ViReMa: a virus recombination mapper of next-generation sequencing data characterizes diverse recombinant viral nucleic acids

Abstract Background Genetic recombination is a tremendous source of intrahost diversity in viruses and is critical for their ability to rapidly adapt to new environments or fitness challenges. While viruses are routinely characterized using high-throughput sequencing techniques, characterizing the genetic products of recombination in next-generation sequencing data remains a challenge. Viral recombination events can be highly diverse and variable in nature, including simple duplications and deletions, or more complex events such as copy/snap-back recombination, intervirus or intersegment recombination, and insertions of host nucleic acids. Due to the variable mechanisms driving virus recombination and the different selection pressures acting on the progeny, recombination junctions rarely adhere to simple canonical sites or sequences. Furthermore, numerous different events may be present simultaneously in a viral population, yielding a complex mutational landscape. Findings We have previously developed an algorithm called ViReMa (Virus Recombination Mapper) that bootstraps the bowtie short-read aligner to capture and annotate a wide range of recombinant species found within virus populations. Here, we have updated ViReMa to provide an “error density” function designed to accurately detect recombination events in the longer reads now routinely generated by the Illumina platforms and provide output reports for multiple types of recombinant species using standardized formats. We demonstrate the utility and flexibility of ViReMa in different settings to report deletion events in simulated data from Flock House virus, copy-back RNA species in Sendai viruses, short duplication events in HIV, and virus-to-host recombination in an archaeal DNA virus.

"Additionally, these aligners may be overly permissive in requiring mapped nucleotides on either side of a recombination junc 5) Are the reads clipped above a particular depth coverage? This feature is especially critical in repetitive viral content, such a No -there is no clipping of reads above a particular depth. ViReMa attempts to map all the reads as exhaustively as possible.
6) Have some of these viruses been enriched for targeted capture? Please, provide this information in the manuscript. In som behaves. Moreover, some aligners may have problems in older versions with these depth values.
HIV reads where obtained by sequencing cDNA amplicons derived through targeted RT_PCR of the gag-pol region of HIV. To c "These datasets were obtained by sequencing cDNA amplicons derived from template-specific RT-PCR of the gag-pol genes." The SeV samples were not template-enriched.
The STIV samples were obtained from clarified supernatant, as described in the main text, and as such are highly enriched fo We have corrected the instances of e.g. "300'000" (U.K. format) to "300,000" (U.S. format). This coverage is indeed high, bu input read count would simply down-sample the number of recombination events detected proportionally to the total coverag recombination frequency (i.e. abundance of reads mapped to a recombination junction relative to the number of reads founds "ViReMa does not make any assumptions with regards to expected read coverage or coverage bias over a reference sequence in the BED files. Recombination 'frequencies' can be calculated by comparing the number of reads mapping to a recombinatio target-capture probes and PCR biases result in uneven genome coverage, it is possible that read counts for specific recombina by visualizing the output SAM files in alignment visualization tools such as Tablet (50). " 7) It was unclear which types of duplications were flagged and if the pipeline covers them.
ViReMa can detect and report short duplications in a viral genome. Examples of this are provided in detail in the HIV section, end would be reported as simple recombination events in the BED file, but where the donor site is downstream of the accepto 8) How does the pipeline deal with contaminants?
There is no explicit handling of contaminants by the ViReMa pipeline. Rather, these must be dealt with as appropriate by the u removed by aligning to known sources of contamination (e.g. mycoplasma genome). To clarify this point and provide guidanc "The expected input for ViReMa is FASTQ data from short-read sequencing platforms such as Illumina. Typically, input data sh strategies (e.g. using BBDuk from the BBMap suite RRID:SCR_016965) or removing reads that align to known contaminant g 9) This article states that the pipeline works for viral sequences. However, the tests used do not include large genomes. What Thank you for this comment. We have certainly considered the application of ViReMa to larger and more complex genomes an (as might be present in the boundaries of repetitive elements), additional supporting information is required to fully resolve re area of on-going research that we feel would be beyond the scope of the current manuscript and may indeed illustrate the lim "We have utilized ViReMa here to uncover evidence of duplications and insertions in clinical HIV isolates in response to antiret junction break-points. However, large duplications that exceed the length of short sequences reads or repetitive regions of so across complex and variable repetitive elements in vaccinia virus, each containing unique point-mutations (72)." 10) While looking for recombination events, specially fusions with the host, what are the differences between sequenced viral Thank you for this comment. On a read-by-read sequence information level, there is no way to distinguish between these two Integration events would be reported in the exact same way in ViReMa. This issue is discussed in the final paragraph of the S "A "Virus-Host-Fusion" event may either describe a direct fusion of viral and host nucleic acids (such as described below for S whether the input material sequenced derived from the host nucleus) and upon downstream interpretation of the results repo 11) The authors state that the pipeline provides accurate results. Regarding the calculation of accuracy values, several good p a)https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Farticle%2F MC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=1fZFx5yLILUdTBaQVbP4 b)https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.sciencedirect.com%2Fscience%2Farticle%2F MC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2Bfh9KiXKRKYlTDp3H Thank you for alerting us to these helpful resources.
We have added the following text to emphasize this availability: "Instructions on how to analyze the small example datasets and to validate correct installation are provided in the associated Reviewer #2: In this work, Sotcheff et al provide a comprehensive and nicely-written report about using the algorithm Virus R algorithm that, by accounting for the high-diversity nature of virus populations, can efficiently detect a wide range of virus re advances in NGS, including the read length and the significant increase in NGS library size and NGS-based experiments. Nota biological connotations. Overall, the paper used a robust analysis method and sufficient controls to clearly demonstrate the ca We thank the reviewer for their positive assessment and time taken reviewing our manuscript. Fig 2E is showing the gradual effect of the permissibility imposed by the error-density values, transforming the table As suggested -we have generated a barchart to replace the tabulated data in Fig 2E, and adjusted the figure legend accordin "Bar-chart depicting the number of duplication events using different 'error density' parameters in the command-line. The par 2) At lines 500-501, the author found that the majority of reads mapped directly to the virus genome. Looking at the aligned We have added the following text to the manuscript:

Minor comments 1) Since
"Given the large size of the dataset, the default '--Chunk' feature would have processed the reads in packets of 1 million read 3) At line 478, the authors stated: "The 'Reads' columns describe the number of reads at each particular nucleotide position", No. This is the exact read count at the coordinates of the referenced template where the unique recombination event has take 4) Typos at line 206 "red", and at line 397 "(NL4-3)" Corrected. Thank you! Close